13 research outputs found

    Linear-time String Indexing and Analysis in Small Space

    Get PDF
    The field of succinct data structures has flourished over the past 16 years. Starting from the compressed suffix array by Grossi and Vitter (STOC 2000) and the FM-index by Ferragina and Manzini (FOCS 2000), a number of generalizations and applications of string indexes based on the Burrows-Wheeler transform (BWT) have been developed, all taking an amount of space that is close to the input size in bits. In many large-scale applications, the construction of the index and its usage need to be considered as one unit of computation. For example, one can compare two genomes by building a common index for their concatenation and by detecting common substructures by querying the index. Efficient string indexing and analysis in small space lies also at the core of a number of primitives in the data-intensive field of high-throughput DNA sequencing. We report the following advances in string indexing and analysis: We show that the BWT of a string T is an element of {1, . . . , sigma}(n) can be built in deterministic O(n) time using just O(n log sigma) bits of space, where sigma We also show how to build many of the existing indexes based on the BWT, such as the compressed suffix array, the compressed suffix tree, and the bidirectional BWT index, in randomized O(n) time and in O(n log sigma) bits of space. The previously fastest construction algorithms for BWT, compressed suffix array and compressed suffix tree, which used O(n log sigma) bits of space, took O(n log log sigma) time for the first two structures and O(n log(epsilon) n) time for the third, where. is any positive constant smaller than one. Alternatively, the BWT could be previously built in linear time if one was willing to spend O(n log sigma log log(sigma) n) bits of space. Contrary to the state-of-the-art, our bidirectional BWT index supports every operation in constant time per element in its output.Peer reviewe

    Patterns of genetic variation in leading-edge populations of Quercus robur : genetic patchiness due to family clusters

    Get PDF
    The genetic structure of populations at the edge of species distribution is important for species adaptation to environmental changes. Small populations may experience non-random mating and differentiation due to genetic drift but larger populations, too, may have low effective size, e.g., due to the within-population structure. We studied spatial population structure of pedunculate oak,Quercus robur, at the northern edge of the species' global distribution, where oak populations are experiencing rapid climatic and anthropogenic changes. Using 12 microsatellite markers, we analyzed genetic differentiation of seven small to medium size populations (census sizes 57-305 reproducing trees) and four populations for within-population genetic structures. Genetic differentiation among seven populations was low (Fst = 0.07). We found a strong spatial genetic structure in each of the four populations. Spatial autocorrelation was significant in all populations and its intensity (Sp) was higher than those reported in more southern oak populations. Significant genetic patchiness was revealed by Bayesian structuring and a high amount of spatially aggregated full and half sibs was detected by sibship reconstruction. Meta-analysis of isoenzyme and SSR data extracted from the (GD)(2)database suggested northwards decreasing trend in the expected heterozygosity and an effective number of alleles, thus supporting the central-marginal hypothesis in oak populations. We suggest that the fragmented distribution and location of Finnish pedunculate oak populations at the species' northern margin facilitate the formation of within-population genetic structures. Information on the existence of spatial genetic structures can help conservation managers to design gene conservation activities and to avoid too strong family structures in the sampling of seeds and cuttings for afforestation and tree improvement purposes.Peer reviewe

    On Suffix Tree Breadth

    Get PDF
    The suffix tree—the compacted trie of all the suffixes of a string—is the most important and widely-used data structure in string processing. We consider a natural combinatorial question about suffix trees: for a string S of length n, how many nodes νS(d) can there be at (string) depth d in its suffix tree? We prove ν(n,d)=maxS∈ΣnνS(d) is O((n/d)logn) , and show that this bound is almost tight, describing strings for which νS(d)=d is Ω((n/d)log(n/d)

    Storage Efficient Substring Searchable Symmetric Encryption

    Get PDF
    We address the problem of substring searchable encryption. A single user produces a big stream of data and later on wants to learn the positions in the string that some patterns occur. Although current techniques exploit auxiliary data structures to achieve efficient substring search on the server side, the cost at the user side may be prohibitive. We revisit the work of substring searchable encryption in order to reduce the storage cost of auxiliary data structures. Our solution entails a suffix array based index design, which allows optimal storage cost O (n) with small hidden factor at the size of the string n. We analyze the security of the protocol in the real ideal framework. Moreover, we implemented our scheme and the state of the art protocol [7] to demonstrate the performance advantage of our solution with precise benchmark results

    Evaluating housing quality, health and safety using an Internet-based data collection and response system: a cross-sectional study

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>Typically housing and health surveys are not integrated together and therefore are not representative of population health or national housing stocks. In addition, the existing channels for distributing information about housing and health issues to the general public are limited. The aim of this study was to develop a data collection and response system that would allow us to assess the Finnish housing stock from the points of view of quality, health and safety, and also to provide a tool to distribute information about important housing health and safety issues.</p> <p>Methods</p> <p>The data collection and response system was tested with a sample of 3000 adults (one per household), who were randomly selected from the Finnish Population Register Centre. Spatial information about the exact location of the residences (i.e. coordinates) was included in the database inquiry. People could participate either by completing and returning a paper questionnaire or by completing the same questionnaire via the Internet. The respondents did not receive any compensation for their time in completing the questionnaire.</p> <p>Results</p> <p>This article describes the data collection and response system and presents the main results of the population-based testing of the system. A total of 1312 people (response rate 44%) answered the questionnaire, though only 80 answered via the Internet. A third of the respondents had indicated they wanted feedback. Albeit a majority (>90%) of the respondents reported being satisfied or quite satisfied with their residence, there were a number of prevalent housing issues identified that can be related to health and safety.</p> <p>Conclusions</p> <p>The collected database can be used to evaluate the quality of the housing stock in terms of occupant health and safety, and to model its association with occupant health and well-being. However, it must be noted that all the health outcomes gathered in this study are self-reported. A follow-up study is needed to evaluate whether the occupants acted on the feedback they received. Relying solely on an Internet-based questionnaire for collecting data would not appear to provide an adequate response rate for random population-based surveys at this point in time.</p

    Episode Matching

    No full text
    . Given two words, text T of length n and episode P of length m, the episode matching problem is to find all minimal length substrings of text T that contain episode P as a subsequence. The respective optimization problem is to find the smallest number w, s.t. text T has a subword of length w which contains episode P . In this paper, we introduce a few efficient off-line as well as on-line algorithms for the entire problem, where by on-line algorithms we mean algorithms which search from left to right consecutive text symbols only once. We present two alphabet independent algorithms which work in time O(nm). The off-line algorithm operates in O(1) additional space while the on-line algorithm pays for its property with O(m) additional space. Two other on-line algorithms have subquadratic time complexity. One of them works in time O(nm= log m) and O(m) additional space. The other one gives a time/space trade-off, i.e., it works in time O(n + s +nm log log s= log(s=m)) when additional s..
    corecore